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£2 (57) Abstract: A microprocessor includes a first processing core, a first cryptographic co-processor and an integer multiplier unit 
thai is coupled to the first processing core and the first cryptographic co-processor. The first processing core includes an instruction 
decode unit, an instruction execution unit, a load/store unit. The first cryptographic co-processor is located on a first die with the 
first processing core. The RrsL cryptographic co-processor includes a cryptographic control register, a direct memory access engine 
thai is coupled to the Uiad/sLore unit in the first processing core and a cryptographic memory. 
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1 

Stream Processor with Cryptographic Co-Processor 

Background of the Invention 

5 

I, Field of the Invention 

[1] The present invention relates generally to microprocessors, and more particularly, to 
microprocessors that include a cryptographic co-processor within the microprocessor die. 

1 0 2. Description of the Related Art 

[2] Server computers (i.e., servers) process all sorts of data transactions. One common 
type of data transaction is an encrypted data transaction that typically requires the server to 
perform at least one of an encryption function and a decryption function. Figure 1 shows a 
typical server 102 and client computer 110 that are linked by a network 104, such as the 
1 5 Internet or other network. 

P] Figure 2 is a high-level block diagram of a typical server 102. As shown, the server 
102 includes a processor 202, ROM 204, and RAM 206, each connected by a peripheral bus 
system 208. The peripheral bus system 208 may include one or more buses connected to 
each other through various bridges, controllers and/or adapters, such as are well known in the 
20 art. For example, the peripheral bus system 208 may include a "system bus" that is 
connected through an adapter to one or more expansion buses, such as a Peripheral 
Component Interconnect (PCI) bus. Also coupled to the peripheral bus system 208 are a 
mass storage device 2 1 0, a network interface 2 12, a number (N) of input/output (I/O) devices 
216-1 through 216-N and a peripheral cryptographic processor 220. 

25 [41 I/O devices 216-1 through 216-N may include, for example, a keyboard, a pointing 
device, a display device and/or other conventional I/O devices. Mass storage device 2 1 0 may 
include any suitable device for storing large volumes of data, such as a magnetic disk or tape, 
magneto-optical (MO) storage device, or any of various types of Digital Versatile Disk 
(DVD) or Compact Disk (CD) based storage. 
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[5] The peripheral cryptographic processor 220 (i.e., crypto-processor) is linked to the 
processor 202 by the peripheral bus system 208. The crypto-processor 220 performs 
encryption and decryption operations that may be necessary for encrypted data transactions 
such as between the server 102 and the client 1 10. In some servers the crypto-processor 220 
5 can also be external to the server 102 and linked to the processor 202 by one of the I/O 
devices 216-1 through 216-N. 

[6] Network interface 2 1 2 provides data communication between the computer system 
and other computer systems on the network 104. Hence, network interface 212 may be any 
device suitable for or enabling the server 102 to communicate data with a remote processing 
10 system (e.g., client computer 1 1 0) over a data communication link, such as a conventional 
telephone modem, an Integrated Services Digital Network (ISDN) adapter, a Digital 
Subscriber Line (DSL) adapter, a cable modem, a satellite transceiver, an Ethernet adapter, or 
the like. 

[7] Typically the processor 202 can operate at clock speeds of up to or more than 1 GHz. 

15 Conversely, the peripheral bus system 208 typically operates at a substantially slower speed 
such as about 166 MHz or similar. Further, the crypto-processor 220 typically operates at a 
speed similar to the peripheral bus system 208. This is because the crypto-processor 220 
cannot process data any faster than the data can be transported across the peripheral bus 
system 208. Further, the crypto-processor 220 is typically a customized, specialized 

20 processor (i.e. an application specific integrated circuit (ASIC)) that may not be made by the 
latest, highest performance manufacturing technologies and therefore the maximum 
processing speed (i.e., the crypto-processor clock speed) of the crypto-processor 220 is 
substantially less than the maximum processing speed of the processor 202. 

|8| Figure 3 is a flowchart diagram of the method operations 300 of a typical encrypted 
25 data transaction within the server 102. Hie encrypted data transaction can be any data 

transaction that required encryption, decryption or both encryption and decryption such as an 
e-commerce transaction between the server 102 and the client computer 110. In operation 
305, data is received in the server 102 such as from the client computer 1 10 or because of a 
request by the client computer \ 10. 
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[9] In operation 3 1 0, the received data is analyzed to determine if the received data is 
encrypted. For example, the data may be encrypted because the data includes a user's 
personal and/or financial data or other data that is transported during an encrypted session 
such as SSL (secure sockets layer) or other encryption methods. 

5 f 10] If the received data is found to not be encrypted data, in operation 310, then the 
received data is processed as described in operation 330 below. Alternatively, if, in 
operation 310, the received data is determined to be encrypted data, then, in operation 315, 
the encrypted data is sent to the peripheral crypto processor 220 via the peripheral bus system 
208. 

10 [1 1] In operation 320, the crypto processor 220 decrypts the encrypted data. In operation 
325, the crypto processor 220 outputs the decrypted data to the processor 202 via the 
peripheral bus system 208. In operation 330, the processor 202 processes the data to produce 
result data. 

[1 2 J In operation 335, the result data is analyzed to determine if the result data should be 
15 encrypted. If the result data does not require encryption, then the processor outputs the result 
data to the client 1 10, in operation 340, and the method operations end. Alternatively, if, in 
operation 335, the result data required encryption, then in operation 345, the processor 
outputs the result data to the crypto-processor via the peripheral bus system 208. 

[131 hi operation 350, the crypto processor 220 encrypts the result data. In operation 355, 
20 die crypto processor 220 outputs the encrypted result data to the processor 202 via the 

peripheral bus system 208. In operation 360, the processor outputs the encrypted result data 
to the client 1 1 0 and the method operations end. 

[14] Transferring the data to be encrypted, decrypted or processed between the crypto 
processor 220 and the processor 202 is very slow. Further, the slower processing speed of 
25 the crypto processor 220 also limits the rate at which the data is encrypted or decrypted. 
Further, if a large volume of data such as streaming data (e.g., streaming audio, streaming 
video, etc.) is being encrypted and/or decrypted then the rate the server 102 can serve the 
streaming data is limited by the rate at which the streaming data can be encrypted and/or 
decrypted. Further still, the multiple transfers of the streaming data between the crypto 



PAGE 7(13 " RCVD AT 1/25/2007 7:28:55 PM [Eastern Standard Time] * SVR:USPTO-EFXRF-2/13 * DNI3:2738300 * CSID:<661) 460-1986 



* DURATION (mm-ss):06-14 



1/25/2007 5:29 PM FROM: (661) 460-1986 Huffman Patent Group, LLC TO: 1-571-273-8300 PAGE: 008 OF 025 



WO 03/036508 PCT/US02/33321 

4 

processor 220 and the processor 202 can dominate the usage of the peripheral bus system 208 
and the I/O systems inside the crypto processor 220 and the processor 202, thereby limiting 
further the ability of the processor 202 to perform any functions other than transferring data 
to and from the crypto processor 220. 

5 [151 hi vicw °f me foregoing, there is a need for a system and method for increased and/or 
more efficient data encryption and decryption process speeds. 

Summary of the Invention 

[16] Broadly speaking, the present invention fills these needs by a system and method for 
increased and/or more efficient data encryption and decryption process speeds. It should be 
10 appreciated that the present invention can be implemented in numerous ways, including as a 
process, an apparatus, a system, computer readable media, or a device. Several inventive 
embodiments of the present invention are described below. 

[17J One embodiment includes a microprocessor includes a first processing core, a first 
cryptographic co-processor and an integer multiplier unit that is coupled to the first 

1 5 processing core and the cryptographic co-processor. The first processing core includes an 
instruction decode unit, an instruction execution unit, a load/store unit. The first 
cryptographic co-processor is located on a first die with the first processing core. The 
cryptographic co-processor includes a cryptographic control register, a direct memory access 
engine that is coupled to the load/store unit in the first processing core and a cryptographic 

20 memory. 

[18] The integer multiplier unit can be included within the first processing core or within 
the first cryptographic co-processor. 

[19] The cryptographic memory is at least large enough to perform a Montgomery 
multiplication function. 

25 [20) In one embodiment, the integer multiplier unit is a 64-bit X 64-bit multiplier unit. 

[21] The load/store unit can be coupled to a main memory system hierarchy. 

[22] The first processing core is coupled to a second processing core by a processor 
crossbar. The second processing core is coupled to a second cryptographic co-processor that 
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is located on a second die with the second processing core. Alternatively, the second 
processing core and the second cryptographic co-processor can be located on the first die. 

[231 Th e fi fSt cryptographic co-processor can be coupled to the instruction decoder unit. 

|24J The first cryptographic co-processor and the first processing core share the integer 
5 multiplier unit 

|25J The direct memory access engine can be coupled to the load/store unit by a 64-bit 
data bus. 

[26] The cryptographic control register can include data that identifies a type of 
cryptographic instruction received in the first cryptographic co-processor. 

10 [27] One alternative embodiment includes a method of executing a cryptographic 

command. A cryptographic instruction is received in an load store unit in a processing core 
on a first die. The cryptographic instruction is analyzed to determine if the cryptographic 
instruction is a crypto store instruction. If the cryptographic instruction is a crypto store 
instruction, then a source operand of the crypto store instruction is stored in a crypto control 

15 register in a cryptographic co-processor on the first die. The source operand is analyzed to 
determine if the source operand identifies a corresponding crypto command. If the source 
operand identifies the corresponding crypto command, the corresponding crypto command is 
executed. 

[28] The cryptographic co-processor can also send an interrupt to an instruction execution 
20 unit that is included in the processing core such as when execution of a crypto command is 
completed. 

[29] A result of the cryptographic instruction can also be output to a memory system using 
a load store unit that is included in the processing core. 

[30] Execution of the cryptographic instruction in the cryptographic co-processor can also 
25 include accessing data through the load store unit. 

[31] The cryptographic co-processor can also include a direct memory access engine. 
Accessing data can also include loading and storing data in a main memory. 
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(32) Executing the cryptographic instruction in the cryptographic co-processor can also 
include executing a multiplication function. The cryptographic co-processor can include an 
integer multiplier unit for executing the multiplication function. 

(331 The various embodiments of the present invention provide the ability for a crypto 
5 processor to rapidly encrypt and/or decrypt data, such as streaming data, at rates much greater 
than possible by a prior art crypto processor. 

[34] Other aspects and advantages of the invention will become apparent from the 
following detailed description, taken in conjunction with the accompanying drawings, 
illustrating by way of example the principles of the invention. 

io Brief Description of the Drawings 

[35] The present invention will be readily understood by the following detailed description 
in conjunction with the accompanying drawings, and like reference numerals designate like 
structural elements. 

[36] Figure 1 shows a typical server and client computer that are linked by a network, such 
1 5 as the Internet or other network. 

[37] Figure 2 is a high-level block diagram of a typical server. 

[38] Figure 3 is a flowchart diagram of the method operations of a typical encrypted data 
transaction within the server. 

[39] Figure 4 shows a single CPU die (chip) in accordance with one embodiment of the 
20 present invention. 

|40J Figure 5 shows a detailed view of the processor core and cryptographic co-processor 
in accordance with one embodiment of the present invention. 

[41] Figure 6 is a flowchart 600 of the method operations of the paired processor 410 and 
crypto co-processor 420 according to one embodiment of the present invention. 

25 
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Detailed Description of the Exemplary Embodiments 

[42] Several exemplary embodiments for a system and method for increased and/or more 
efficient data encryption and decryption process speeds will now be described. It will be 
5 apparent to those skilled in the art thai the present invention may be practiced without some 
or all of the specific details set forth herein. 

[43] Figure 4 shows a single CPU die (chip) 400 in accordance with one embodiment of 
the present invention. The CPU chip 400 includes a processing core 410. The processing 
core 410 is paired with a cryptographic co-processor 420. The cryptographic co-processor 

10 420 is optimized to minimize the amount of hardware added to the CPU die 400. The CPU 
die 400 can also include additional processing cores 4 1 On and each of the additional 
processing cores 41 On is also paired with a cryptographic co-processor 420n. The processing 
cores 410, 4 1 On are electrically coupled by a processor crossbar 430. The processor crossbar 
430 is a data bus system that provides a common data communication link between the 

1 5 processing cores 4 10, 4 lOn and other common devices that may be accessed by the 
processing cores 410, 410n such as memory systems and input/output (I/O) systems. 

[44] Figure 5 shows a detailed view of the processor core 410 and cryptographic co- 
processor 420 in accordance with one embodiment of the present invention. The processor 
core 410 includes an instruction decode/trap handler unit 504, an instruction execution unit 
20 506, a load store unit 508 and an integer multiplier unit 5 14. An instruction cache 502 is 
coupled to the input of the instruction decode/trap handler unit 504. A data cache 520 is 
coupled to the load store unit 508. The data cache 520 and the instruction cache 502 are also 
coupled to the processor crossbar 430. The data cache 520 can be a level- 1 cache and can be 
between about 4kb and about 64kb in size. 

25 [45] The crypto-coprocessor 420 includes a crypto control register 5 10, a crypto memory 
5 16, a DMA engine 5 1 8 and the crypto co-processor core 512. The crypto control register 
510 stores the settings of the crypto co-processor 420. The settings can include identifying 
the type of encryption or decryption command or operation to be performed. The types of 
encryption or decryption command can include any of the encryption and decryption schemes 
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known in the art. The crypto control register 510 also stores the status of the current crypto 
operations and is accessible by the processor 410 so that the processor 410 can check the 
status of the current crypto operations. The crypto control register 510 is linked to the load 
store unit 508 by a logical link 530, which represents the bi-directional interchange of data 
5 between the load store unit 508 and the crypto control register 510. 

[46] The DMA engine 518 is coupled to the load store unit 508 by a crypto bus 522. The 
DMA engine 5 1 8 provides more direct access to the memory system hierarchy, such as the 
main memory, the data cache 520 and a level-2 cache, that can be accessed via die load store 
unit 508 and, if necessary, the processor crossbar 430. The crypto bus 522 can be as wide as 
10 feasible to enable rapid data transfer between the crypto coprocessor 420 and the processing 
core 410. In one embodiment, the crypto bus 522 is a 64-bit bus. 

[47J The crypto memory 5 1 6 is sufficiently large enough to hold the operands and results 
for a particular crypto operation. By way of example, in an RSA decryption application, the 
crypto memory 5 1 6 is about 1.3 KB, which is large enough for a modular exponentiation on 
15 2048-bit keys. 

[48| As shown in Figure 5, the integer multiplier unit 5 1 4 is included within the crypto 
processor core 512 but directly accessible by the processor core 4 10 by a logical link 532. 
Alternatively, the integer multiplier unit 5 1 4 can be part of the processor core 4 1 0 as long as 
the crypto processor core 512 can directly access the integer multiplier unit 5 14. In this 
20 manner the crypto processor core 512 and the processor core 410 can share the integer 

multiplier unit 514 so as to reduce the space used (i.e., number of devices required) on the 
die. Typical, prior art crypto processors were not included on the die 400 with the processor 
core because the crypto processors consumed too much valuable space on the die that was 
needed more for the processor core. 

25 [49] Sharing several components significantly reduces the "footprint" of the crypto 
processor so as to allow the crypto processor to be placed on the same die 400 as the 
processor core. The integer multiplier unit 514 performs the modular multiply and modular 
exponentiation functions. The processor core 410 only uses the integer multiplier unit 514 
about 2-5% of the time. Therefore the crypto processor 420 can use the integer multiplier 

30 514 about 95-98% of the time without impacting operations within the processor core 410. 



PAGE 12/13 * RCVD AT 1/25/2007 7:28:55 PM [Eastern Standard Time] * SVR:USPTO-EFXRF-2/13 » DNIS:2738300 * CSID:(661) 460-1986 



* DURATION (mm-ss):06-14 



1/25/2007 S:29 PM FROM: (661) 460-1986 Huffman Patent Group, LLC TO: 1-571-273-8300 PAGE: 013 OF 025 



WO 03/036508 PCT/US02/33321 

9 

Moving the integer multiplier unit 5 14 into the crypto processor 420 further streamlines the 
crypto functions of the integer multiplier unit 514. 

[50| The integer multiplier unit 5 14 is capable of performing modular arithmetic such as 
Montgomery multiply functions and exponentiation. A Montgomery multiply function is a 
5 technique for performing modular multiplication on a large integer (e.g. a 2048 bit number) 
using two multiplications rather than a multiplication and a division. The integer multiplier 
unit 514 can be a 64-bit X 64-bit integer multiplier unit. A 64-bit X 64-bit integer 
multiplication unit can directly access operands that are stored in the crypto memory 516 
rather than flooding the crypto bus 522 and the load store unit 508 every clock cycle. 

10 Flooding the crypto bus 522 and the load store unit 508 every clock cycle would effectively 
stall the processor core 410 because the load store unit 508 would only be able to address the 
demands of the crypto processor 420. Having a crypto memory 516 that is sufficiently large 
enough to perform a complete modular exponentiation relieves the data throughput load on 
the crypto bus 522 and the load store unit 508 and thereby allows the processor core 410 and 

15 the crypto processor 420 to operate in simultaneously on different operations and functions 
for many clock cycles. 

[51] Figure 6 is a flowchart 600 of the method operations of the paired processor 410 and 
crypto co-processor 420 according to one embodiment of the present invention. The 
instruction cache 502 temporarily stores the next instruction to be executed in the processor 
20 core 410. In operation 605 the next instruction is received in the instruction decode/trap 
handler unit 504 from the instruction cache 502. 

[521 The received instruction is forwarded to the instruction execution unit 506 for 
execution in operation 610. The instruction execution unit 506 analyzes the received 
instruction to determine if the received instruction is a load or a store instruction in operation 
25 615. 

|53| If, in operation 615 above, the received instruction is not a load or store instruction, 
the instruction is executed as required in the various stages 504, 506, 508 of the processor 
core 410 as necessary to complete execution of the non-load/non-store instruction, in 
operation 620 and the method operations on the executed instruction result ends. 
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